library("knitr") # for knitting RMarkdown
library("kableExtra") # for making nice tables
library("janitor") # for cleaning column names
library("broom") # for tidying up linear models
library("corrr") # for calculating correlations between many variables
library("corrplot") # for plotting correlations
library("GGally") # for running ggpairs() function
library("tidyverse") # for wrangling, plotting, etc.
# include references for used packages
knitr::write_bib(.packages(), "packages.bib") Let’s load the data sets that we’ll explore in this class:
# credit data set
df.credit = read_csv("data/credit.csv") %>%
rename(index = X1) %>%
clean_names()
# advertising data set
df.ads = read_csv("data/advertising.csv") %>%
clean_names() %>%
rename(index = x1)| variable | description |
|---|---|
| income | in thousand dollars |
| limit | credit limit |
| rating | credit rating |
| cards | number of credit cards |
| age | in years |
| education | years of education |
| gender | male or female |
| student | student or not |
| married | married or not |
| ethnicity | African American, Asian, Caucasian |
| balance | average credit card debt |
Let’s take a look at a case where we have multiple continuous predictor variables. In this case, we want to make sure that our predictors are not too highly correlated with each other (as this makes the interpration of how much each variable explains the outcome difficult). So we first need to explore the pairwise correlations between variables.
The corrr package is great for exploring correlations between variables. To find out more how corrr works, take a look at this vignette:
Here is an example that illustrates some of the key functions in the corrr package (using the advertisement data):
#> rowname index tv radio newspaper sales
#> 1 index
#> 2 tv .02
#> 3 radio -.11 .05
#> 4 newspaper -.15 .06 .35
#> 5 sales -.05 .78 .58 .23
df.credit %>%
select_if(is.numeric) %>%
correlate(quiet = T) %>%
select(rowname, income) %>%
mutate(rowname = reorder(rowname, income)) %>%
drop_na() %>%
ggplot(aes(x = rowname,
y = income,
fill = income)) +
geom_hline(yintercept = 0) +
geom_col(color = "black",
show.legend = F) +
scale_fill_gradient2(low = "indianred2",
mid = "white",
high = "skyblue1",
limits = c(-1, 1)) +
coord_flip() +
theme(axis.title.y = element_blank())Figure 11.1: Bar plot illustrating how strongly different variables correlate with income.
tmp = df.credit %>%
select_if(is.numeric) %>%
correlate(diagonal = 0,
quiet = T) %>%
rearrange() %>%
column_to_rownames() %>%
as.matrix() %>%
corrplot()#> Registered S3 method overwritten by 'seriation':
#> method from
#> reorder.hclust gclus
Figure 11.2: Pairwise correlations with scatter plots, correlation values, and densities on the diagonal.
With some customization:
df.ads %>%
select(-index) %>%
ggpairs(lower = list(continuous = wrap("points",
alpha = 0.3)),
upper = list(continuous = wrap("cor", size = 8))) +
theme(panel.grid.major = element_blank())Figure 11.3: Pairwise correlations with scatter plots, correlation values, and densities on the diagonal (customized).
Now that we’ve explored the correlations, let’s have a go at the multiple regression.
We’ll first take another look at the pairwise relationships:
tmp.x = "tv"
# tmp.x = "radio"
# tmp.x = "newspaper"
# tmp.y = "radio"
tmp.y = "radio"
# tmp.y = "tv"
ggplot(df.ads,
aes_string(x = tmp.x, y = tmp.y)) +
stat_smooth(method = "lm",
color = "black",
fullrange = T) +
geom_point(alpha = 0.3) +
annotate(geom = "text",
x = -Inf,
y = Inf,
hjust = -0.5,
vjust = 1.5,
label = str_c("r = ", cor(df.ads[[tmp.x]], df.ads[[tmp.y]]) %>%
round(2) %>% # round
str_remove("^0+") # remove 0
),
size = 8) +
theme(text = element_text(size = 30))TV ads and radio ads aren’t correlated. Yay!
Let’s see whether adding radio ads is worth it (over and above having TV ads).
# fit the models
fit_c = lm(sales ~ 1 + tv, data = df.ads)
fit_a = lm(sales ~ 1 + tv + radio, data = df.ads)
# do the F test
anova(fit_c, fit_a)#> Analysis of Variance Table
#>
#> Model 1: sales ~ 1 + tv
#> Model 2: sales ~ 1 + tv + radio
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 198 2102.53
#> 2 197 556.91 1 1545.6 546.74 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It’s worth it!
Let’s evaluate how well the model actually does. We do this by taking a look at the residual plot, and check whether the residuals are normally distributed.
tmp.fit = lm(sales ~ 1 + tv + radio, data = df.ads)
df.plot = tmp.fit %>%
augment() %>%
clean_names()
# residual plot
ggplot(df.plot,
aes(x = fitted,
y = resid)) +
geom_point()
# density of residuals
ggplot(df.plot,
aes(x = resid)) +
stat_density(geom = "line")
# QQ plot
ggplot(df.plot,
aes(sample = resid)) +
geom_qq() +
geom_qq_line() There is a slight non-linear trend in the residuals. We can also see that the residuals aren’t perfectly normally distributed. We’ll see later what we can do about this …
Let’s see how well the model does overall:
fit_a %>%
glance() %>%
kable(digits = 3) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.897 | 0.896 | 1.681 | 859.618 | 0 | 3 | -386.197 | 780.394 | 793.587 | 556.914 | 197 |
As we can see, the model almost explains 90% of the variance. That’s very decent!
Here is a way of visualizing how both tv ads and radio ads affect sales:
df.plot = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
augment() %>%
clean_names()
df.tidy = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
tidy()
ggplot(df.plot, aes(x = radio, y = sales, color = tv)) +
geom_point() +
scale_color_gradient(low = "gray80", high = "black") +
theme(legend.position = c(0.1, 0.8))We used color here to encode TV ads (and the x-axis for the radio ads).
In addition, we might want to illustrate what relationship between radio ads and sales the model predicts for three distinct values for TV ads. Like so:
df.plot = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
augment() %>%
clean_names()
df.tidy = lm(sales ~ 1 + tv + radio, data = df.ads) %>%
tidy()
ggplot(df.plot, aes(x = radio, y = sales, color = tv)) +
geom_point() +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[2] * 200,
slope = df.tidy$estimate[3]) +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[2] * 100,
slope = df.tidy$estimate[3]) +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[2] * 0,
slope = df.tidy$estimate[3]) +
scale_color_gradient(low = "gray80", high = "black") +
theme(legend.position = c(0.1, 0.8))Fitting the augmented model yields the following estimates for the coefficients in the model:
fit_a %>%
tidy(conf.int = T) %>%
head(10) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 2.92 | 0.29 | 9.92 | 0 | 2.34 | 3.50 |
| tv | 0.05 | 0.00 | 32.91 | 0 | 0.04 | 0.05 |
| radio | 0.19 | 0.01 | 23.38 | 0 | 0.17 | 0.20 |
One thing we can do to make different predictors more comparable is to standardize them.
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> Please use a list of either functions or lambdas:
#>
#> # Simple named list:
#> list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`:
#> tibble::lst(mean, median)
#>
#> # Using lambdas
#> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> This warning is displayed once per session.
df.ads %>%
select(-newspaper) %>%
head(10) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)| index | tv | radio | sales | tv_scaled | radio_scaled |
|---|---|---|---|---|---|
| 1 | 230.1 | 37.8 | 22.1 | 0.97 | 0.98 |
| 2 | 44.5 | 39.3 | 10.4 | -1.19 | 1.08 |
| 3 | 17.2 | 45.9 | 9.3 | -1.51 | 1.52 |
| 4 | 151.5 | 41.3 | 18.5 | 0.05 | 1.21 |
| 5 | 180.8 | 10.8 | 12.9 | 0.39 | -0.84 |
| 6 | 8.7 | 48.9 | 7.2 | -1.61 | 1.73 |
| 7 | 57.5 | 32.8 | 11.8 | -1.04 | 0.64 |
| 8 | 120.2 | 19.6 | 13.2 | -0.31 | -0.25 |
| 9 | 8.6 | 2.1 | 4.8 | -1.61 | -1.43 |
| 10 | 199.8 | 2.6 | 10.6 | 0.61 | -1.39 |
We can standardize (z-score) variables using the scale() function.
# tmp.variable = "tv"
tmp.variable = "tv_scaled"
ggplot(df.ads,
aes_string(x = tmp.variable)) +
stat_density(geom = "line",
size = 1) +
annotate(geom = "text",
x = median(df.ads[[tmp.variable]]),
y = -Inf,
label = str_c("sd = ", sd(df.ads[[tmp.variable]]) %>% round(2)),
size = 10,
vjust = -1,
hjust = 0.5
) +
annotate(geom = "text",
x = median(df.ads[[tmp.variable]]),
y = -Inf,
label = str_c("mean = ", mean(df.ads[[tmp.variable]]) %>% round(2)),
size = 10,
vjust = -3,
hjust = 0.5
)Scaling a variable leaves the distribution intact, but changes the mean to 0 and the SD to 1.
Let’s compare a compact model that only predicts the mean, with a model that uses the student variable as an additional predictor.
# fit the models
fit_c = lm(balance ~ 1, data = df.credit)
fit_a = lm(balance ~ 1 + student, data = df.credit)
# run the F test
anova(fit_c, fit_a)#> Analysis of Variance Table
#>
#> Model 1: balance ~ 1
#> Model 2: balance ~ 1 + student
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 399 84339912
#> 2 398 78681540 1 5658372 28.622 1.488e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Call:
#> lm(formula = balance ~ 1 + student, data = df.credit)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -876.82 -458.82 -40.87 341.88 1518.63
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 480.37 23.43 20.50 < 2e-16 ***
#> studentYes 396.46 74.10 5.35 1.49e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 444.6 on 398 degrees of freedom
#> Multiple R-squared: 0.06709, Adjusted R-squared: 0.06475
#> F-statistic: 28.62 on 1 and 398 DF, p-value: 1.488e-07
The summary() shows that it’s worth it: the augmented model explains a signifcant amount of the variance (i.e. it significantly reduces the proportion in error PRE).
Let’s visualize the model predictions. Here is the compact model:
ggplot(df.credit,
aes(x = index,
y = balance)) +
geom_hline(yintercept = mean(df.credit$balance),
size = 1) +
geom_segment(aes(xend = index,
yend = mean(df.credit$balance)),
alpha = 0.1) +
geom_point(alpha = 0.5) It just predicts the mean (the horizontal black line). The vertical lines from each data point to the mean illustrate the residuals.
And here is the augmented model:
df.fit = fit_a %>%
tidy() %>%
mutate(estimate = round(estimate,2))
ggplot(df.credit,
aes(x = index,
y = balance,
color = student)) +
geom_hline(yintercept = df.fit$estimate[1],
size = 1,
color = "#E41A1C") +
geom_hline(yintercept = df.fit$estimate[1] + df.fit$estimate[2],
size = 1,
color = "#377EB8") +
geom_segment(data = df.credit %>%
filter(student == "No"),
aes(xend = index,
yend = df.fit$estimate[1]),
alpha = 0.1,
color = "#E41A1C") +
geom_segment(data = df.credit %>%
filter(student == "Yes"),
aes(xend = index,
yend = df.fit$estimate[1] + df.fit$estimate[2]),
alpha = 0.1,
color = "#377EB8") +
geom_point(alpha = 0.5) +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(reverse = T))Note that this model predicts two horizontal lines. One for students, and one for non-students.
Let’s make simple plot that shows the means of both groups with bootstrapped confidence intervals.
ggplot(data = df.credit,
mapping = aes(x = student, y = balance, fill = student)) +
stat_summary(fun.y = "mean",
geom = "bar",
color = "black",
show.legend = F) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1) +
scale_fill_brewer(palette = "Set1")And let’s double check that we also get a signifcant result when we run a t-test instead of our model comparison procedure:
t.test(x = df.credit$balance[df.credit$student == "No"],
y = df.credit$balance[df.credit$student == "Yes"])#>
#> Welch Two Sample t-test
#>
#> data: df.credit$balance[df.credit$student == "No"] and df.credit$balance[df.credit$student == "Yes"]
#> t = -4.9028, df = 46.241, p-value = 1.205e-05
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -559.2023 -233.7088
#> sample estimates:
#> mean of x mean of y
#> 480.3694 876.8250
When we put a variable in a linear model that is coded as a character or as a factor, R automatically recodes this variable using dummy coding. It uses level 1 as the reference category for factors, or the value that comes first in the alphabet for characters.
df.credit %>%
select(income, student) %>%
mutate(student_dummy = ifelse(student == "No", 0, 1))%>%
head(10) %>%
kable(digits = 2) %>%
kable_styling(bootstrap_options = "striped",
full_width = F)| income | student | student_dummy |
|---|---|---|
| 14.89 | No | 0 |
| 106.03 | Yes | 1 |
| 104.59 | No | 0 |
| 148.92 | No | 0 |
| 55.88 | No | 0 |
| 80.18 | No | 0 |
| 21.00 | No | 0 |
| 71.41 | No | 0 |
| 15.12 | No | 0 |
| 71.06 | Yes | 1 |
To report the results, we could show a plot like this:
df.plot = df.credit
ggplot(df.plot,
aes(x = student,
y = balance)) +
geom_point(alpha = 0.1,
position = position_jitter(height = 0, width = 0.1)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "linerange",
size = 1) +
stat_summary(fun.y = "mean",
geom = "point",
size = 4)And then report the means and standard deviations together with the result of our signifance test:
df.credit %>%
group_by(student) %>%
summarize(mean = mean(balance),
sd = sd(balance)) %>%
mutate_if(is.numeric, funs(round(., 2)))#> # A tibble: 2 x 3
#> student mean sd
#> <chr> <dbl> <dbl>
#> 1 No 480. 439.
#> 2 Yes 877. 490
Now let’s take a look at a case where we have one continuous and one categorical predictor variable. Let’s first formulate and fit our models:
# fit the models
fit_c = lm(balance ~ 1 + income, df.credit)
fit_a = lm(balance ~ 1 + income + student, df.credit)
# run the F test
anova(fit_c, fit_a)#> Analysis of Variance Table
#>
#> Model 1: balance ~ 1 + income
#> Model 2: balance ~ 1 + income + student
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 398 66208745
#> 2 397 60939054 1 5269691 34.331 9.776e-09 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see again that it’s worth it. The augmented model explains significantly more variance than the compact model.
Let’s visualize the model predictions again. Let’s start with the compact model:
df.augment = fit_c %>%
augment() %>%
clean_names()
ggplot(df.augment,
aes(x = income,
y = balance)) +
geom_smooth(method = "lm", se = F, color = "black") +
geom_segment(aes(xend = income,
yend = fitted),
alpha = 0.3) +
geom_point(alpha = 0.3)This time, the compact model still predicts just one line (like above) but note that this line is not horizontal anymore.
df.tidy = fit_a %>%
tidy() %>%
mutate(estimate = round(estimate,2))
df.augment = fit_a %>%
augment() %>%
clean_names()
ggplot(df.augment,
aes(x = income,
y = balance,
group = student,
color = student)) +
geom_segment(data = df.augment %>%
filter(student == "No"),
aes(xend = income,
yend = fitted),
color = "#E41A1C",
alpha = 0.3) +
geom_segment(data = df.augment %>%
filter(student == "Yes"),
aes(xend = income,
yend = fitted),
color = "#377EB8",
alpha = 0.3) +
geom_abline(intercept = df.tidy$estimate[1],
slope = df.tidy$estimate[2],
color = "#E41A1C",
size = 1) +
geom_abline(intercept = df.tidy$estimate[1] + df.tidy$estimate[3],
slope = df.tidy$estimate[2],
color = "#377EB8",
size = 1) +
geom_point(alpha = 0.3) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = c(0.1, 0.9)) +
guides(color = guide_legend(reverse = T))The augmented model predicts two lines again, each with the same slope (but the intercept differs).
Let’s check whether there is an interaction between how income affects balance for students vs. non-students.
Let’s take a look at the data first.
ggplot(data = df.credit,
mapping = aes(x = income,
y = balance,
group = student,
color = student)) +
geom_smooth(method = "lm", se = F) +
geom_point(alpha = 0.3) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = c(0.1, 0.9)) +
guides(color = guide_legend(reverse = T))Note that we just specified here that we want to have a linear model (via geom_smooth(method = "lm")). By default, ggplot2 assumes that we want a model that includes interactions. We can see this by the fact that two fitted lines are not parallel.
But is the interaction in the model worth it? That is, does a model that includes an interaction explain significantly more variance in the data, than a model that does not have an interaction.
Let’s check:
# fit models
fit_c = lm(formula = balance ~ income + student, data = df.credit)
fit_a = lm(formula = balance ~ income * student, data = df.credit)
# F-test
anova(fit_c, fit_a)#> Analysis of Variance Table
#>
#> Model 1: balance ~ income + student
#> Model 2: balance ~ income * student
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 397 60939054
#> 2 396 60734545 1 204509 1.3334 0.2489
Nope, not worth it! The F-test comes out non-significant.
Information about this R session including which version of R was used, and what packages were loaded.
#> R version 3.6.2 (2019-12-12)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3
#> [5] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 tidyverse_1.3.0
#> [9] GGally_1.4.0 ggplot2_3.2.1 corrplot_0.84 corrr_0.4.0
#> [13] broom_0.5.3 janitor_1.2.0 kableExtra_1.1.0 knitr_1.26
#>
#> loaded via a namespace (and not attached):
#> [1] colorspace_1.4-1 ellipsis_0.3.0 snakecase_0.11.0
#> [4] htmlTable_1.13.3 base64enc_0.1-3 fs_1.3.1
#> [7] rstudioapi_0.10 farver_2.0.1 fansi_0.4.0
#> [10] lubridate_1.7.4 xml2_1.2.2 codetools_0.2-16
#> [13] splines_3.6.2 zeallot_0.1.0 Formula_1.2-3
#> [16] jsonlite_1.6 cluster_2.1.0 dbplyr_1.4.2
#> [19] png_0.1-7 compiler_3.6.2 httr_1.4.1
#> [22] backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18
#> [25] lazyeval_0.2.2 cli_2.0.0 acepack_1.4.1
#> [28] htmltools_0.4.0 tools_3.6.2 gtable_0.3.0
#> [31] glue_1.3.1 reshape2_1.4.3 Rcpp_1.0.3
#> [34] cellranger_1.1.0 vctrs_0.2.1 gdata_2.18.0
#> [37] nlme_3.1-142 iterators_1.0.12 xfun_0.11
#> [40] rvest_0.3.5 lifecycle_0.1.0 gtools_3.8.1
#> [43] dendextend_1.13.2 MASS_7.3-51.4 scales_1.1.0
#> [46] TSP_1.1-7 hms_0.5.2 RColorBrewer_1.1-2
#> [49] yaml_2.2.0 gridExtra_2.3 rpart_4.1-15
#> [52] reshape_0.8.8 latticeExtra_0.6-29 stringi_1.4.3
#> [55] highr_0.8 gclus_1.3.2 foreach_1.4.7
#> [58] checkmate_1.9.4 seriation_1.2-8 caTools_1.17.1.3
#> [61] rlang_0.4.2 pkgconfig_2.0.3 bitops_1.0-6
#> [64] evaluate_0.14 lattice_0.20-38 htmlwidgets_1.5.1
#> [67] labeling_0.3 tidyselect_0.2.5 plyr_1.8.5
#> [70] magrittr_1.5 bookdown_0.16 R6_2.4.1
#> [73] gplots_3.0.1.1 generics_0.0.2 Hmisc_4.3-0
#> [76] DBI_1.1.0 pillar_1.4.3 haven_2.2.0
#> [79] foreign_0.8-72 withr_2.1.2 survival_3.1-8
#> [82] nnet_7.3-12 modelr_0.1.5 crayon_1.3.4
#> [85] utf8_1.1.4 KernSmooth_2.23-16 rmarkdown_2.0
#> [88] viridis_0.5.1 jpeg_0.1-8.1 grid_3.6.2
#> [91] readxl_1.3.1 data.table_1.12.8 reprex_0.3.0
#> [94] digest_0.6.23 webshot_0.5.2 munsell_0.5.0
#> [97] registry_0.5-1 viridisLite_0.3.0
Firke, Sam. 2019. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Henry, Lionel, and Hadley Wickham. 2019. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
Müller, Kirill, and Hadley Wickham. 2019. Tibble: Simple Data Frames. https://CRAN.R-project.org/package=tibble.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David, and Alex Hayes. 2019. Broom: Convert Statistical Analysis Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom.
Ruiz, Edgar, Simon Jackson, and Jorge Cimentada. 2019. Corrr: Correlations in R. https://CRAN.R-project.org/package=corrr.
Schloerke, Barret, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Joseph Larmarange. 2018. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Wei, Taiyun, and Viliam Simko. 2017a. Corrplot: Visualization of a Correlation Matrix. https://CRAN.R-project.org/package=corrplot.
———. 2017b. R Package "Corrplot": Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
———. 2019a. Forcats: Tools for Working with Categorical Variables (Factors). https://CRAN.R-project.org/package=forcats.
———. 2019b. Stringr: Simple, Consistent Wrappers for Common String Operations. https://CRAN.R-project.org/package=stringr.
———. 2019c. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, and Hiroaki Yutani. 2019. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2019. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Wickham, Hadley, and Lionel Henry. 2019. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.
Wickham, Hadley, Jim Hester, and Romain Francois. 2018. Readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.name/knitr/.
———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.